Rank Diversity of Languages: Generic Behavior in Computational Linguistics
نویسندگان
چکیده
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: "heads" consist of words which almost do not change their rank in time, "bodies" are words of general use, while "tails" are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.
منابع مشابه
Total Rank Distance And Scaled Total Rank Distance: Two Alternative Metrics In Computational Linguistics
In this paper we propose two metrics to be used in various fields of computational linguistics area. Our construction is based on the supposition that in most of the natural languages the most important information is carried by the first part of the unit. We introduce total rank distance and scaled total rank distance, we prove that they are metrics and investigate their max and expected value...
متن کاملA Comparative Study of Introduction and Discussion sections of Sub-disciplines of Applied Linguistics Research Articles
Much has been written in the past few decades about the reasons why many research articles (RAs) do not find their ways into well-established academic journals. While some doubt viable comparison between "big" English-language journals (to use Swales' 2004 words) or international journals (IJs) and "small" ones published in other local languages, there is still a good many reasons to hope for t...
متن کاملOn the Behavior of Certain Metric on the Permutations Group
When a new metric is introduced, often there is a "hidden variable" in the similarity relation (frequently depends on the specific area of research), so that we should always speak of similarity with respect to some property, and there is a plethora of measures in part because researchers are often inexplicit on this point. On the other hand, one should have some knowledge about the nature of t...
متن کاملExploring Sub-Disciplinary Variations and Generic Structure of Applied Linguistics Research Article Introductions Using CARS Model
This paper explores sub-disciplinary variations and generic structure of research article introductions (RAIs) within three sub-disciplines of applied linguistics (AL); namely, English for Specific Purposes (ESP), Psycholinguistics, and Sociolinguistics, using Swales’(1990) CARS model. The corpus consisted of 90 RAIs drawn from a wide range of refereed journals in the corresponding sub-discipli...
متن کاملResearch Article Introductions: Sub-disciplinary Variations in Applied Linguistics
The present study aimed to investigate the generic organization of research article introductions in local Iranian and international journals in English for Specific Purposes, English for General Purposes, and Discourse Analysis. Overall, 120 published articles were selected from the established journals representing the above subdisciplines. Each subdiscipline was represented by 20 local and 2...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 10 شماره
صفحات -
تاریخ انتشار 2015